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Abstract 

Tumor samples are typically lieterogeneous, containing admixture by normal, non-cancerous cells and one or more 
subpopulations of cancerous cells. Whole-genome sequencing of a tumor sample yields reads from this mixture, 
but does not directly reveal the cell of origin for each read. We introduce THetA (Tumor Heterogeneity Analysis), 
an algorithm that infers the most likely collection of genomes and their proportions in a sample, for the case 
where copy number aberrations distinguish subpopulations. THetA successfully estimates normal admixture and 
recovers clonal and subclonal copy number aberrations in real and simulated sequencing data. THetA is available 
at http://compbio.cs.brown.edu/software/. 
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Background 

Cancer is a disease driven in part by somatic mutations, 
which accumulate during the hfetime of an individual. 
The clonal theory of cancer progression [1] states that 
the cancerous cells in a tumor are descended from a 
single founder cell and that descendants of this founder 
cell acquired multiple mutations beneficial for tumor 
growth through multiple rounds of selection and clonal 
expansion. A tumor is thus a heterogeneous population 
of cells, each cell potentially containing a different com- 
plement of somatic mutations. These include both clo- 
nal mutations from the founder cell or early rounds of 
clonal expansion and subclonal mutations that occurred 
after the most recent clonal expansion. Alternatively, 
subclonal mutations may suggest that the tumor is poly- 
clonal, consisting of subpopulations of cells that are not 
all descended from a single founder cell [2]. 

High-throughput DNA sequencing technologies are 
now giving an unprecedented view of this intra-tumor 
mutational heterogeneity [3]. However, nearly all recent 
cancer sequencing projects generate DNA sequence 
from tumor samples consisting of many cells - including 
both normal (non-cancerous) cells and one or more dis- 
tinct populations of tumor cells. The tumor purity of a 
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sample is the fraction of cells in the sample that are 
cancerous, and not normal cells. If a sample has a low 
tumor purity, then the power to detect all types of 
somatic aberrations in the cancer genomes is reduced. 
For example, lower tumor purity attenuates copy num- 
ber ratios or allele frequencies away from the values 
expected with integral copy numbers. Methods to detect 
somatic copy number aberrations or loss of heterozygos- 
ity (LOH) from SNP array data or array comparative 
genomic hybridization (aCGH) data must account for 
this issue [4-9]. In addition, many algorithms for identi- 
fying somatic single-nucleotide mutations from DNA 
sequence reads implicitly or explicitly rely on an esti- 
mate of tumor purity. For example, the VarScan 2 pro- 
gram [10] uses an estimate of tumor purity as input to 
calibrate the expected number of reads that contain a 
somatic mutation at a locus. 

Traditionally, tumor purity was assessed by visual ana- 
lysis of tumor cells, either manually by a pathologist or 
via image analysis [11]. Recently, methods such as 
ASCAT [12] and ABSOLUTE [13] were introduced to 
estimate tumor purity directly from SNP array data. 
Both of these methods utilize the presence of copy 
number aberrations in cancer genomes to estimate both 
tumor purity and tumor ploidy, which is the number of 
copies of segments of chromosomes or entire chromo- 
somes. Tumor purity and tumor ploidy are intertwined; 
for example, a heterozygous deletion of one copy of a 
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chromosome in a 100% pure tumor sample (containing 
one cancer genome) could also be explained as a homo- 
zygous deletion in a 50% pure tumor sample (containing 
one cancer genome). Thus, it is necessary to estimate 
tumor purity and ploidy simultaneously, but this is a 
subtle and difficult problem. ASCAT and ABSOLUTE 
address this problem by estimating the average ploidy 
over the entire cancer genome. These estimates of 
tumor purity and average ploidy are then used in a sec- 
ond step to derive copy number aberrations. 

Both ASCAT and ABSOLUTE have been shown to 
yield accurate estimates of tumor purity, achieving in 
some cases better estimates than via pathology or other 
techniques. However, these methods also have impor- 
tant limitations. First, the mathematical models used by 
ASCAT and ABSOLUTE are optimized for SNP array 
data, as we detail below. While these methods may be 
adapted to run on DNA sequencing data (for example, 
for ABSOLUTE see [14] and for ASCAT see below), the 
underlying mathematical model used by both methods 
does not adequately describe the characteristics of 
sequencing data. Second, both of these methods apply 
various heuristics in their estimation procedures, such 
as rounding copy numbers to the closest integer [12] 
and do not directly infer integer copy numbers for each 
segment of the genome during the estimation. Finally, 
both methods do not explicitly identify multiple tumor 
subpopulations, and instead infer only a single tumor 
subpopulation. For example, ABSOLUTE [13] classifies 
copy number aberrations as outliers if they are not clo- 
nal, but does not refine these outliers into subpopula- 
tions. If a tumor sample consists of multiple tumor 
subpopulations, then considering only a single tumor 
population may yield inaccurate estimates of tumor pur- 
ity, as we show below. 

High-throughput DNA sequencing data is much 
higher resolution data than SNP arrays, and provides 
the opportunity to derive highly accurate estimates of 
both tumor purity and the composition of tumor subpo- 
pulations. For example, the number of reads containing 
a somatic single-nucleotide mutation at a locus provides 
- in principle - an estimate of the fraction of cells in a 
tumor sample containing this mutation. However, three 
interrelated factors complicate this analysis: (1) The 
number of reads supporting a somatic single-nucleotide 
mutation has high variance, implying that an estimate of 
the allele frequency will be highly unreliable at the mod- 
est coverages (30x to 40x) employed in nearly all cur- 
rent cancer sequencing projects. (2) Somatic mutations 
may be present in only a fraction of tumor cells. (3) 
Somatic copy number aberrations (nearly ubiquitous in 
solid tumors) alter the number of copies of the locus 
containing the mutation. While the first issue might be 
addressed in part by clustering allele frequency estimates 



across the genome [15-17], the second and third issues 
complicate such a clustering. Recent methods for ana- 
lyzing tumor composition from DNA sequencing data 
either ignore copy number aberrations [17] or use itera- 
tive approaches [18] or other approximations [12,13], 
and do not formally model the generation of DNA 
sequencing data from a mixture of integral copy num- 
bers for each genomic segment. 

Beyond the estimation of tumor purity and ploidy, it is 
desirable to identify subclonal aberrations, which can 
provide information on the age or history of the tumor 
[19], and can yield further insight into tumors that fail 
to respond to treatment or metastasize [19-21]. How- 
ever, even with a pure tumor sample, characterizing 
subclonal mutations is a challenge. Tolliver et al. [22] 
infer subclonal copy number aberrations by comparing 
aberrations across different individuals, thus assuming 
that the progression of somatic copy number aberrations 
is conserved across individuals. Gerlinger et al. [23] 
recently demonstrated the extent of subclonal mutations 
by sequencing multiple (spatially separated) samples 
from a tumor, complementing earlier studies of hetero- 
geneity using microarray-based techniques [24]. In 
another approach. Ding et al. [17] used a targeted ultra- 
deep sequencing (1,000 x coverage) approach to esti- 
mate allele frequencies for relapse mutations in acute 
myeloid leukemia (AML). In another recent study, Nik- 
Zainal et al. [25] used a SNP array based estimate of 
tumor purity [12] followed by extensive manual analysis 
of somatic mutations to identify a clonal (majority) 
population and a number of subclonal populations in 
each of several breast cancer genomes. Ultimately, sin- 
gle-cell sequencing techniques promise to provide a 
comprehensive view of cancer heterogeneity [26-29], but 
these techniques presently require specialized DNA 
amplification steps, which can introduce artifacts and 
also incur higher costs because they sequence many 
cells. Thus, the problem of the simultaneous estimation 
of and correction for tumor purity as well as the identi- 
fication of clonal and subclonal mutations will remain a 
challenge for the majority of cancer sequencing projects. 

In this paper, we introduce Tumor Heterogeneity Ana- 
lysis (THetA), an algorithm that infers the most likely 
collection of genomes and their proportions from high- 
throughput DNA sequencing data, in the case where 
copy number aberrations distinguish subpopulations. In 
contrast to existing methods, we formulate and optimize 
an explicit probabilistic model for the generation of the 
observed tumor sequencing data from a mixture of a nor- 
mal genome and one or more cancer genomes, each gen- 
ome containing integral copy numbers of its segments. 
Specifically, we derive and solve the maximum likelihood 
mixture decomposition problem (MLMDP) of finding a 
collection of genomes - each differing from the normal 
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genome by copy number aberrations - whose mixture 
best explains the observed sequencing data. Thus, we 
generaUze the problem of estimating tumor purity to the 
problem of determining the proportions of normal cells 
and any number of tumor subpopulations in the sample. 

Our formulation and solution of the MLMDP 
leverages the fact that copy number aberrations create a 
strong signal in DNA sequencing data: even relatively 
small copy number aberrations cause deviations in the 
alignments of thousands to millions of reads. Thus, in 
contrast to single-nucleotide mutations, where there is 
high variance in the number of reads at each position, 
many measurements (reads) are perturbed for each copy 
number aberration. Thus, each copy number aberration 
provides many data points for deconvolution of the 
tumor genome mixture. We show how to solve the 
MLMDP as a collection of convex optimization pro- 
blems. THetA is the first algorithm - to our knowledge 
- that automatically identifies subclonal copy number 
aberrations in whole-genome sequencing data from mix- 
tures of more than two genomes. Moreover, in the case 
of an admixture between a single (clonal) cancer popu- 
lation and normal cells, THetA runs in polynomial time; 
it is the first rigorous and efficient algorithm for simul- 
taneously estimating tumor purity and inferring integral 
copy numbers. 

We apply our THetA algorithm to simulated data and 
to real DNA sequencing data from breast tumors 
sequenced at approximately 188x and approximately 40 x 
coverage from [25] . We quantify the normal cell admix- 
ture in each tumor, outperforming other algorithms for 
this task. We also demonstrate that allowing only one 
tumor subpopulation may lead to highly inaccurate 
tumor purity estimates, and subsequent failure to detect 
clonal and subclonal copy number aberrations. In the 
188 X sequenced tumor, we identify both clonal and sub- 
clonal tumor cell populations, each containing unique 
copy number aberrations. Our results recapitulate most 
of the findings reported in [25] for this sample, but also 
have some distinct differences, which are supported by 
the sequencing data. In one of the 40x sequenced 
tumors, we identified two previously unreported tumor 
subpopulations, demonstrating the ability to identify 
intra-tumor heterogeneity, in particular subclonal aberra- 
tions, at the modest sequence coverages that are the cur- 
rent standard in cancer sequencing studies. 

Results 

Maximum likelihood mixture decomposition problem 

First, we will formulate the maximum likelihood mixture 
decomposition problem of finding the most likely mixture 
of tumor cell populations from a sequenced tumor sample. 
We assume that sequenced reads from a tumor sample are 
aligned to the reference human genome, the first step in 



cancer genome sequencing analysis [30,31]. Typically, a 
matched normal genome is also sequenced to distinguish 
somatic mutations from germline variants. We focus on 
copy number aberrations in order to estimate tumor purity 
and subpopulations. Thus, we assume that a cancer gen- 
ome differs from the reference genome by gains and losses 
of segments, or intervals, of the reference genome. These 
intervals are identified by examining the density, or depth, 
of reads aligning to each location in the reference [32-34], 
and/or by clustering discordant paired reads that identify 
the breakpoints of copy number aberrations or other rear- 
rangements [35-40]. Following this analysis, the reference 
genome is partitioned into a sequence I = (/i, /„) of 
non-overlapping intervals. We represent a cancer genome 
by an interval count vector c = (ci, c^), where Cj e N is 
the integer number of copies of interval Ij in the cancer 
genome. From the sequencing of a tumor sample, we 
observe a read depth vector r = (ri, r^), where r, e N is 
the number of reads with a (unique) alignment within Ij. 

A tumor sample is a mixture of cells that contain dif- 
ferent collections of somatic mutations, and in particular 
somatic copy number aberrations. We assume that the 
tumor sample is a mixture of « subpopulations, includ- 
ing a subpopulation of normal cells and one or more 
subpopulations of cancer cells. Each subpopulation has 
a distinct interval count vector representing the genome 
of the subpopulation. Thus, we represent a tumor sam- 
ple T by: (1) an w X « interval count matrix C = [c,/,], 
where Cjh & M is the number of copies of interval Ij in 
the hth distinct subpopulation; and (2) a genome mixing 
vector fi G R" where fi^ is the fraction of cells in 7" 
from the hth subpopulation. Given a read depth vector r 
derived from the sequence of T> our goal is to identify 
the underlying interval count matrix C and genome 
mixing vector fi that best describe r (Figure 1). We for- 
mulate the following problem. 

Maximum likelihood mixture decomposition pro- 
blem (MLMDP). Given an interval partition I of a 
reference genome and an associated read depth vector r 
derived from a tumor sample T, find the underlying 
interval count matrix C and genome mixing vector [i 
that maximize the likelihood -P(r|C, /<). 

In the Materials and methods section below, we derive 
the probability P(r|C, \a) in the MLMDP. In brief, under 
the usual assumptions for DNA sequencing, the probabil- 
ity pj that a read that aligns to an interval Ij is equal to the 
fraction of the total DNA in the sample originating from 
interval ly Hence, the probability P(r|C, ia) of the observed 
read depth vector r follows a multinomial distribution 
determined by the interval count matrix C and genome 
mixing vector \a. We emphasize that the multinomial dis- 
tribution models the fact that the number of reads aligning 
to each interval are not independent random variables, but 
rather are dependent on the number of copies (ploidy) of 
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Figure 1 Algorithm overview. A mixture of three subpopulations with two distinct genomes: a normal genome (represented here with one 
copy of each interval for simplicity), and an aneuploid genome with a duplication of one interval (red). If reads are distributed uniformly over 
the aggregate DNA in the sample, then the observed distribution of reads over the blue, red and yellow intervals will follow a multinomial 
distribution with parameter Here C is the interval count matrix giving the integral number of copies of each interval in each genome in the 
mixture, and fj is the genome mixing vector giving the proportion of each subpopulation in the mixture. We find the pair (C, /j) that maximizes 
the likelihood of the observed read depth vector r. 



each interval in the cancer genome(s) (please see Addi- 
tional file 1, Section A). In contrast to our probabilistic 
model for DNA sequencing data, other methods for esti- 
mating tumor purity and ploidy [12,13] do not model the 
data as an observation from an experiment. Rather, they 
assume that the observed copy number ratio of an interval 
(or probe) is the ratio of the expected value of the tumor 
copy number and the expected value of the normal copy 
number (please see Additional file 1, Section B). Thus, 
they implicitiy assume that the observed data is an average 
over many experiments. 

Solving the maximum likelihood mixture decomposition 
problem 

We now give an overview of our algorithm for solving 
the instance of the MLMDP where ^(rlC, ^) is the mul- 
tinomial probability described above. Further details are 
in the Materials and methods section. 
Restricting ttie space of interval count matrices 
In practice, the interval count matrix C is not allowed to 
be any integer-valued matrix. There are three natural 
constraints on the interval count matrix: (1) One com- 
ponent of the tumor sample is the normal genome. 
Thus, we set the first column Ci = (2, 2, 2)^, the vec- 
tor whose entries are all two. (2) The number n of sub- 
populations is less than the number m of intervals. (3) 
The copy numbers of the intervals are integers between 
0 and k, inclusive, where k > 2. We let Cm,n,k denote the 
set of all matrices satisfying these properties. 
A convex optimization algorittim 

We wish to find the interval count matrix C e Cm,n,k and 
the genome mixing vector fi that maximize the multinomial 
likelihood P{t\C, //). However, this optimization problem is 



not straightforward to solve because it contains both inte- 
ger-valued variables (entries of C) and real-valued variables 
(entries of fi). We show that a special coordinate transfor- 
mation allows the MLMDP to be solved as a disjunction of 
constrained convex optimization problems by enumerating 
the possible interval count matrices and solving a separate 
convex optimization problem for each such C (see Materi- 
als and methods). Since the number of possible matrices C 
grows exponentially with m and n, this brute-force strategy 
approach will not scale well beyond small values of n sub- 
populations and m intervals. In a special, but important, 
case where a sample contains a single clonal tumor popula- 
tion along with a normal admixture (that is, n = 2), we 
show how to further restrict the space of possible interval 
count matrices C, and obtain an efficient algorithm (poly- 
nomial time in m) for the MLMDP. The runtime for our 
algorithm depends on the number of intervals m and maxi- 
mum copy number k in the input. Simulations with m = 39 
and /c = 3 (described below) run in 1 to 2 minutes on a 
standard desktop, while increasing to /c = 5 increases the 
runtime to approximately 25 to 40 minutes. 
Selecting a solution 

Two additional issues to be addressed in deriving a solu- 
tion are: (1) how to select from multiple optimal solu- 
tions and (2) how to choose the number n of tumor 
subpopulations in the mixture. We note that tumor 
sequencing data alone does not distinguish between dif- 
ferent optimal solutions with the same maximum likeli- 
hood. In mathematical terms, this is because only the 
parameter of the multinomial distribution is identifiable 
from the observed read depth vector r. Thus, we cannot 
distinguish between pairs (C, ^) and (C, fi') of interval 
count matrices and genome mixing vectors that give the 
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same multinomial parameter. Our algorithm THetA has 
options to return the complete family of optimal solu- 
tions, or to limit to solutions with a baseline copy num- 
ber of the clonal tumor population (see Materials and 
methods). 

Regarding the second issue, note that the likelihood P 
(r|C, ^) increases as the number n of tumor subpopula- 
tions in the mixture increases: indeed the observed read 
depth vector can be fitted 'perfectly' by placing each 
copy number aberration in its own tumor subpopula- 
tion. However, mixtures with larger n also have greater 
model complexity (that is, more parameters). We use a 
model selection criterion based on the Bayesian infor- 
mation criterion (BIC) to select a model with a balance 
between higher likelihood and lower model complexity 
in order to avoid overfitting. 

Evaluation on simulated cancer genomes 
Normal admixture: single cancer genome 

Using two different sets of simulated data, we compared 
our THetA algorithm to three other methods for esti- 
mating tumor purity and ploidy: ASCAT [12], ABSO- 
LUTE [13] and CNAnorm [18]. ASCAT and 
ABSOLUTE jointly estimate tumor purity and ploidy, 
and were originally designed for SNP array data. While 
both can be adapted to run on DNA sequencing data, 
they do not formally model this type of data, as noted 
above. CNAnorm is designed for DNA sequencing data, 
but rather than allowing tumor purity and tumor ploidy 
to inform each other, it uses an iterative approach that 
separately infers purity and copy numbers. In some 
instances, CNAnorm relies on the user manually enter- 
ing the most abundant ploidy. 

As noted above, there are multiple optimal solutions 
with the same maximum likelihood. CNAnorm [18] and 
ASCAT [12] use ad hoc criteria to return only a single 
purity estimate, and ABSOLUTE [13] uses external can- 
cer karyotypes to select from multiple possible solutions. 
To compare THetA to these other methods, we must 
select a single pair (C, fi) from the set returned by 
THetA as a representative sample reconstruction. For all 
simulations, we chose the pair (C, fi) that maximizes the 
total length of all genomic intervals in the tumor gen- 
ome with copy number 2, the expected copy number of 



the normal genome for humans. We note that this deci- 
sion applies only to these simulations - for real sequen- 
cing data the set of all equally like solutions is returned 
by THetA from which a user may select one using addi- 
tional information about the sample under considera- 
tion. For further details about the other algorithms 
please see Additional file 1, Sections K and L. 

For our first set of simulations, we generated a cancer 
genome consisting of chromosome arm copy number 
aberrations. The copy number for each non-acrocentric 
chromosome arm was chosen uniformly at random 
from the range 0 (that is, homozygous deletion) through 
k > 2 (amplification), up to a specified maximum copy 
number k. While real cancer genomes may have copy 
numbers larger than the maximum value (/c = 7) consid- 
ered in these simulations, such high amplitude amplifi- 
cations are generally focal events. We emphasize that it 
is not necessary to use all copy number aberrations to 
infer the tumor composition; for example, if there are a 
sufficient number of arm-level copy number aberrations, 
these may suffice. We then created a random mixture of 
this cancer genome and a 'matched normal' genome and 
simulated a read depth vector r for the mixture, adding 
noise according to the read depth estimation error <j). 
The parameter (f) models errors in the sequencing and 
analysis process, and we estimated from real sequencing 
data that 4> is in the range from 0.01 to 0.04 (please see 
Additional file 1, Section J and Figure S3). Since the 
ASCAT algorithm uses SNP array data, we also simu- 
lated SNP array data from our mixture. Further details 
of the simulations are in Additional file 1, Section L 

Table 1 shows how the four algorithms performed on 
the simulated datasets with interval count matrix 
C G C39,2,fe and mixing vector fi and read depth estima- 
tion error (f) = 0.03. For each value of k, the maximum 
copy number, 20 simulated datasets were generated. 
The percentage correct C is the percentage of datasets 
where the inferred interval count matrix C* exactly 
equals the true simulated matrix C for the sample. The 

copy number error is |C — C*L, that is the 

m (n - 1) ' '-^ 

average error per copy number estimate made, or per 

entry in C, where error is the Euclidean distance 

between C and C*. The purity error is |/L(,2 — ^^^^ 



Table 1 Performance of the algorithms on simulated data with one tumor population (n = 2) 
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Copy number error (median) 




Purity error (median) 
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CNAnorm 
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CNAnorm 


ABSOLUTE 


THetA 


ASCAT 


CNAnorm 
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3 


100.0 


85.0 


40.0 


70.0 


0.0 


0.0 


0.103 


0.000 


0.004 


0.040 


0.068 


0.010 


4 


90.0 


55.0 


8.3 


50.0 


0.0 


0.0 


0.163 


0.013 


0.004 


0.037 


0.064 


0.010 


5 


85.0 


50.0 


6.7 


15.0 


0.0 


0.013 


0.185 


0.160 


0.004 


0.062 


0.038 


0.075 


6 


55.0 


40.0 


0.0 


15.0 


0.0 


0.026 


0.291 


0433 


0.006 


0.063 


0.066 


0.157 


7 


30.0 


15.0 


0.0 


10.0 


0.031 


0.036 


0.445 


0471 


0.005 


0.069 


0.108 


0.149 
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the distance between the true and inferred tumor purity. 
We only calculated results for CNAnorm where the 
inferred purity was <100% (there were between 12 and 
15 trials for each k). Additional file 1, Figure S4 illus- 
trates the results obtained by each algorithm on one of 
the datasets when k = 7. 

For this first set of simulations, we found that our 
THetA algorithm computes both C and ^ very accu- 
rately over a range of copy numbers k. In particular, 
THetA outperforms CNAnorm, ABSOLUTE and 
ASCAT despite the fact that ASCAT uses additional 
information (allele frequencies) that THetA does not 
consider. Even with high amplitude copy number aber- 
rations (/c = 7) THetA on average estimates tumor pur- 
ity within 0.5% of the true purity, compared to 6.9%, 
10.8% and 14.9% by ASCAT, CNAnorm and ABSO- 
LUTE respectively. Even when THetA does not estimate 
all copy numbers across the genome correctly, it esti- 
mates most copy numbers correctly and estimates the 
copy number correctly for more segments than the 
other algorithms (see Additional file 1, Figure S4). 
Further results comparing THetA to CNAnorm for dif- 
ferent read depth estimation errors are in Additional file 
1, Figure S5. 

We also compared THetA, CNAnorm, and ABSO- 
LUTE using a second set of simulated mixtures of 
tumor and normal cells created using real sequencing 
data from an AML tumor sample and matched normal 
sample (TCGA-AB-2965) from The Cancer Genome 
Atlas (TCGA) [41]. This sample was chosen due to its 
high purity (approximately 95% pure) and lack of copy 
number aberrations. We spiked 10 copy number var- 
iants of length 2.5 Mb at random non-overlapping posi- 
tions in Chr20 (excluding the centromere) into the 
tumor genome. As in the first set of simulations, the 
copy number for each variant was chosen uniformly at 
random from the range 0 (that is, homozygous deletion) 
through k > 2 (amplification), up to a specified maxi- 
mum copy number k = 5. (We did not run ASCAT on 
this simulated data since this algorithm was designed 
only for microarray data.) We again found that THetA 
outperforms both CNAnorm and ABSOLUTE on all 
measures (Figure 2). In particular, THetA estimates the 
sample purity with an order of magnitude better accu- 
racy (using the root mean squared error as a metric of 
comparison as was done for ABSOLUTE [13]), and con- 
sistently identifies more true copy number aberrations 
than the other algorithms across different purity values 
and sequencing coverage. In particular, THetA identifies 
7.4 and 2.2 more copy number aberrations, on average, 
than ABSOLUTE and CNAnorm, respectively, across all 
purity values at 30 x sequencing coverage. Even when we 
relax the requirement that a copy number aberration 
must be predicted with the correct copy number, and 



instead count any non-normal copy number as correct, 
THetA still outperforms the other algorithms (see Addi- 
tional file 1, Figure S7). Further details of the simula- 
tions are in Additional file 1, Section I. 
Mixture of tumor subpopulations 

We next evaluated the performance of THetA on a 
simulated mixture containing two subpopulations of 
tumor cells with different copy number aberrations and 
an admixture with normal cells. Thus, there were three 
distinct subpopulations in the mixture (« = 3). Our 
method for constructing the simulated data was the 
same as for the first set of simulations, as described in 
the previous section, with a fixed read depth estimation 
error of (f) = 0.02 along with a few minor changes (see 
Additional file 1, Section I). 

Table 2 shows how the three algorithms performed on 
the simulated datasets with interval count matrix C e 
Cm,3,3 and mixing vector and read depth estimation 
error (j) = 0.02. The percentage correct C and copy 
number error are defined as for Table 1. We defined the 
purity error as the distance between the true and pre- 
dicted fraction of tumor cells in the sample. Thus, pur- 
ity error is |(1 — fii) — (l — /U,*)!^, as the proportion of 
tumor cells in the sample is 1- ^i. Since CNAnorm and 
ABSOLUTE are not able to infer multiple subpopula- 
tions, their percentage correct C = 0, and we list only 
their purity estimates. We only calculated results for 
CNAnorm where the inferred purity was <100% (there 
were between 14 and 18 trials for each m). 

While the performance of THetA was less precise in esti- 
mating all copy numbers (that is, the entries in C) exactly 
than for the tumor with a normal admixture {n = 2), 
THetA maintains a good level of accuracy as the estimates 
are near the true interval copy numbers. THetA correctly 
computes on average 94% of the copy numbers across all 
subpopulations in the mixture when there are w = 12 
intervals with varying copy number in the subpopulations. 
THetA also estimates the tumor purity with good accuracy 
(within 3.6% of the true purity when m = 12), whereas 
both CNAnorm and ABSOLUTE gravely misestimate the 
tumor purity by 30.1% and 54.7%, respectively. One possi- 
ble explanation for these errors is that both of these meth- 
ods do not account for multiple subpopulations in the 
sample and therefore tend to report tumor purity as either 
the fraction of the sample representing the largest subpo- 
pulation, or as an average of the fractions of all tumor sub- 
populations. Thus, THetA successfully recovers a complex 
mixture of two tumor subpopulations and a normal cell 
admixture directly from the observed read depth. 

Results from breast cancer sequencing data 

We analyzed Illumina paired-end sequencing data from 
three breast cancer genomes and their matched normal 
samples from [25]. We downloaded the data from the 
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Figure 2 Comparison of THetA to CNAnorm and ABSOLUTE on simulated mixtures from real sequencing data (A) Comparison of true 
and inferred tumor purity by THetA, CNAnorm and ABSOLUTE on simulated mixtures of DNA sequencing data from an acute myeloid leul<emia 
sample and a matched normal sample. Gray dashed line indicates True Purity = Inferred Purity. Below each plot are the root mean squared 
errors (RMSEs) for each method. (B) Comparison of the number of copy number aberrations correctly predicted (defined as 50% reciprocal 
overlap in position and correct integral copy count) by each method for varying tumor purity and sequence coverage. Num TP is the number of 
true positive copy number aberrations predicted. In most cases, THetA outperforms both CNAnorm and ABSOLUTE. Similar results counting 
aberrations with correct position (with 50% reciprocal overlap) but allowing for difference between true and predicted copy number are in 
Additional file 1, Figure S7. RMSE: root mean squared error. 
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Table 2 Performance of the algorithms on simulated data 
with two tumor populations (n = 3) 





% correct C 


Copy number 
error (median) 


Purity error (median) 


m 


THetA 


THetA 


THetA 


CNAnorm 


ABSOLUTE 


6 


35.0 


0.118 


0.081 


0.202 


0.458 


8 


45.0 


0.075 


0.052 


0.276 


0.477 


10 


35.0 


0.071 


0.055 


0.177 


0.434 


12 


45.0 


0.059 


0.036 


0301 


0.547 



European Genome-phenome Archive (accession number 
EGAD00001000138). This includes two samples that 
were sequenced to a depth of approximately 40x cover- 
age and one sample, PD4120a, that was sequenced with 
approximately 188x coverage. We used the BIC-Seq seg- 
mentation algorithm [32] to partition the 22 autosomes 
into intervals according to read depth. We formed an 
interval count vector from all intervals longer than 50 
kb, focusing on these longer genomic intervals because 
their observed read depth will have lower variance (see 
Additional file 1, Figure Sll). Most intervals removed in 
this step are relatively short for the samples analyzed. 
For the two 40x coverage genomes changing this cutoff 
to 10 kb resulted in the same partition as when 50 kb 
was used. For the 188 x genome we only removed nine 
intervals from consideration when the threshold was 50 
kb and seven when the threshold was 10 kb. The results 
from THetA are identical for the two different sets of 
intervals. We assume that most of the tumor genome 
does not undergo copy number aberrations, and thus 
the mode of the read depth vector is a normal baseline. 
We set lower and upper bounds on the copy number 
for each interval from this baseline. For further details, 
see Additional file 1, Section N. 
Breast tumor: 188x sequence coverage 
We analyzed the 188x sequenced tumor PD4120a using 
our THetA algorithm. We consider that the mixture 
contains a normal admixture with a single tumor subpo- 
pulation (n = 2) and a normal admixture with two dis- 
tinct tumor subpopulations (« = 3). This sample was 
extensively annotated by [25] and thus acts a positive 
control for THetA. 

Table 3 shows the tumor purity and copy number 
aberrations identified by the various algorithms on the 
188x coverage breast cancer genome (sample PD4120a). 
Assuming a single tumor subpopulation admixed with 
normal cells (« = 2), THetA's estimate of tumor purity 
(65.7%) and inferred copy number aberrations are very 
close to those obtained by CNAnorm [18] (67.2%), 
ASCAT (66.0%) [12] and ABSOLUTE (65%) [13]. How- 
ever, all of these estimates are lower than the tumor 
purity of 70% reported by [25], who identified a second 
tumor subpopulation in the sample (see below). Because 



ABSOLUTE, ASCAT, CNAnorm and THetA (with n = 
2) do not model multiple tumor subpopulations, their 
reported tumor purities are an average of the fraction of 
aberrant cells amongst the different subpopulations in 
the tumor sample, and thus generally smaller than the 
tumor purity estimate obtained when we allow more 
than one tumor subpopulation (see below). In addition, 
we note that ASCAT used additional information (B- 
allele frequencies), while THetA, CNAnorm and ABSO- 
LUTE used only read depth. The identified aberrations 
do not distinguish between those in different subpopula- 
tions, but do include several previously reported in 
breast cancer [42-47]. We also ran THetA using chro- 
mosome arms as the intervals, rather than the BIC-Seq 
intervals. Using chromosome arms, we estimated a simi- 
lar sample purity of 61.7% and predicted the same set of 
copy number aberrations as with the BIC-Seq intervals. 

Assuming « = 3 subpopulations - normal cells plus 
two distinct subpopulations of cancer cells - we analyzed 
a subset of longer intervals that are most informative for 
copy number analysis (Table 3). THetA's estimate of 
72% tumor purity is slightly higher than the 70% 
reported by [25]. Moreover, THetA's estimate of tumor 
purity is higher than the approximately 65% to 67% 
tumor purity given above for ABSOLUTE, ASCAT and 
CNAnorm, three methods that assume only one tumor 
subpopulation. Our BIC model selection chose this solu- 
tion as a better representation of the data (Figure 3A), 
than the solution that only considers a mixture of nor- 
mal cells and a single tumor population. Using the n = 
3 model we identified all copy number variants identi- 
fied above for a single tumor population, plus some 
additional aberrations including subclonal deletions of 
chromosomes 8, 11, 12, 14 and 15 not identified under 
that model (nor by the other algorithms). This demon- 
strates THetA's ability to identify copy number aberra- 
tions in subpopulations of cells. While many of the 
clonal and subclonal copy number aberrations found by 
THetA are identical to those reported by [25], there are 
several notable differences including: a clonal deletion of 
16q and different classification of aberrations on chro- 
mosomes 1 and 22 as clonal vs. subclonal. Table 3 dis- 
plays the complete set of differences where aberrations 
in bold indicate a difference between our predictions 
and those reported by [25]. Aberrations reported by [25] 
include several chromosomes not considered as part of 
our analysis (2, 6, 7, 9, 18 and 21). In Table 3 italicized 
aberrations were not input to the « = 3 THetA analysis, 
and were inferred by examination of read depth ratios 
corrected for normal admixture and tumor cell fractions 
derived from THetA (see Additional file 1, Section Q). 

We investigated further the following three differences 
between our analysis and [25]: (1) clonal deletion of 
chromosome 16q, (2) clonal vs. subclonal amplification 
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Table 3 Comparison of various algorithms on the ISSx coverage breast cancer genome 

Sample PD4120a 



Algorithm % normal Clonal Subclonal (%)° 

admixture (% tumor purity)^ 



THetA, n = 2 


343% 


Del: 1p, 4q, 13, 16q, 22q 


- 


(segmentation) 




+1: Iq 
(65.7%) 




THetA, n = 2 (cliromosome arms) 


383% 


Del: 1p, 4q, 13, 16q, 22q 
+1: Iq 
(61.7%) 


- 


CNAnorm'' 


32.8% 


Del: 1p, 4q, 13, 16q, 22q 


- 


(chromosome arms) 




+1: Iq 
(67.2%) 




ASCAr 


34% 


Del: 1p, 4q, 13, 16q, 22q 




(virtual SNP array) 




+1: Iq, 17q, 18, 19, 20 
(66.0%) 




ABSOLUTE"' (segmentation) 


35% 


(65.0%) 




THetA 


28.0% 


Del: 1p, 4q, 16q, 22q12.2- 


Del: 13, 22ql 1.2-12.1 


n = 3 




13.3 

(72.0%) 


+ 1: Iq 

(61.9%) 

Del: 8, 11, 12, 14, 15 

(10.1%) 

Del: 2, 7, 4p, 6, 
9, 18, 21 


Nik-Zainal et al. (2012) 


30% 


Del: 4q 


Del: 13, t(1;22) 


[25] 




+ 1: Iq 

(70.0%) 


(47.6%) 
Tetraploid with: 



Del(-2): 2, 7 
Del(-l): 6, 8, 9, 11, 12, 
14, 15, 18, 21 
(9.8%) 



a Entries in bold are differences between THetA and [25]. Entries in italics were not input to the n = 3 THetA analysis but were inferred using THetA's output. 
^ When CNAnorm was run using BIC-Seq intervals the normal admixture was estimated at 6.7%, therefore we report results from CNAnorm using chromosome 
arms. CNAnorm does not return integer copy numbers - and thus we report aberrations where the returned copy number was within 0.15 of the nearest integer, 
other aberrations were considered inconclusive. 

^ For ASCAT we use virtual SNP array data as input. ASCAT performs its own segmentation; we list only the large aberrations. 

^ We report here the maximum likelihood solution returned by ABSOLUTE when considering only karyotypes. When considering only somatic copy number 
aberrations or a combination, ABSOLUTE infers a tetraploid solution. For this sample, ABSOLUTE returns copy numbers for only a subset of the input intervals, so 
we do not report specific copy number aberrations predicted. 



of chromosome Iq and (3) clonal vs. subclonal deletions 
in chromosome 22q. We analyzed these differences 
using two complementary approaches. First, we analyzed 
the distribution of tumor/normal read depth ratios in 50 
kb bins across the genome. This distribution contains 
distinct peaks corresponding to copy number aberra- 
tions occurring in different subpopulations. After cor- 
recting the read depth ratios for a normal admixture 
using a linear scaling (see Additional file 1, Section O), 
peaks corresponding to clonal aberrations will occur at 
ratios divisible by 0.5, whereas peaks corresponding to 
subclonal aberrations will not (Figure 3B). Second, we 
analyzed a virtual SNP array that we constructed from 
the read counts and the variant allele frequencies 



derived from aligned reads at known germline SNPs 
(see Materials and methods). Copy number aberrations 
occurring in different subpopulations appear as distinct 
clusters in a scatter plot of read count vs. variant allele 
frequencies (Figure 3C). 

The first difference we analyzed was our prediction of 
a clonal deletion of chromosome 16q, which was not 
reported by [25]. Visual inspection of the virtual SNP 
array data for chromosome 4 (predicted to be a clonal 
deletion by both methods) and chromosome 16q shows 
three distinct clusters - one for regions of normal copy 
(centered at a variant allele frequency of 0.5) and two 
clusters (positioned symmetrically around a variant allele 
frequency of 0.5) with a lower read count that indicate a 
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Figure 3 Analysis of the 188x coverage breast tumor PD4120a. (A) (Left) Read depth ratios (gray) and the copy number aberrations inferred 
by our algorithm when n = 3 including the normal population (black), dominant (clonal) tumor population (blue) and subdonal tumor 
population (red). (Right) A reconstruction of the tumor mixture with the inferred aberrations and estimated fraction of cells in each 
subpopulation. (B) Read depth ratios in 50 kb intervals after centering so chromosome 3 has a mean of 1 and correcting for 28% normal 
admixture using a simple linear scaling. (C) Virtual SNP array results showing distinct clusters of regions according to the number of reads 
containing each SNP and fraction of reads supporting the variant allele. (D) Virtual SNP array data comparing variant allele fractions and read 
counts for chromosomes 4 and 16. This data demonstrate that both chromosomes have undergone the same type of copy number aberration, 
which we predicted to be a clonal deletion in 72% of cells in the sample. (E) Virtual SNP array data for chromosomes 13 and 22. Chromosome 
22q1 1.2-12.1 and chromosome 13 appear to be affected by the same type of aberration, which we predicted to be a subdonal deletion in 
61.9% of cells in the sample. In contrast, 22q12.2-13.3 is different, and the data are consistent with a clonal deletion. See Additional file 1, Figure 
SI 3 for further details. 
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deletion (Figure 3D). These deletion clusters have vir- 
tually identical locations in the scatter plot for chromo- 
somes 4 and 16q - supporting the conclusion that these 
deletions occur in the same fraction of the tumor sam- 
ple. Comparing the difference between the observed and 
expected read depth ratios in these deletions for differ- 
ent aberration fractions (the percentage of the sample 
containing the aberration) reveals that the optimal aber- 
ration fraction for both deletions is very similar - addi- 
tional evidence that these deletions occur in the same 
fraction of the tumor sample (see Additional file 1, Sec- 
tion Q and Figure SI 2). Given the strong evidence for 
this chromosome 16 deletion, we suspect that its omis- 
sion from [25] was an oversight rather than a deficiency 
of the analysis. 

The second difference is that we predicted chromo- 
some Iq to be amplified in a subclonal population con- 
sisting of 61.9% of the cells in the sample, whereas [25] 
indicated that this aberration is clonal (occurring in 70% 
of cells in the sample). Since this was the only large 
amplification present in the sample, we were not able to 
compare its variant allele frequencies to a different 
amplification (as we did with chromosomes 4 and 16 
above). Therefore, we examined the read depth data 
more closely. Visual inspection of read depth ratios after 
adjusting for our predicted 28% normal admixture 
(Figure 3B) and the 30% normal admixture predicted by 
[25] (see Additional file 1, Figure Sll) shows that the 
corrected read depth ratios for chromosome Iq do not 
match a ratio of 1.5 well (as would be expected if there 
was a clonal amplification with copy number 3) - an 
indication that Iq is a subclonal aberration. Comparison 
of read depth ratios for Iq to other clonal aberrations 
supports our prediction that Iq is a subclonal deletion 
(see Additional file 1, Figure SI 2). 

The final difference involves chromosome 22q; we 
predicted that it contains both clonal and subclonal 
deletions, while [25] only reported subclonal events. In 
particular, [25] reported that a deletion of a derivative 
chromosome from a translocation between chromo- 
somes 1 and 22 is the rearrangement responsible for the 
subclonal deletion observed on 22q. We found that Ip 
(see Additional file 1, Figure S12) and the distal portion 
of 22q (cytogenetic bands 12.2-13.3) appear to be clonal 
deletions, while the proximal portion of 22q (cytogenetic 
bands 11.2-12.1) is a subclonal deletion. In particular, 
the read-depth/variant-allele plot from the virtual SNP 
array shows an oblong cluster for chromosome 22 that 
only partially overlaps with the cluster for chromosome 
13, a chromosome predicted by both methods to have 
undergone a subclonal deletion (Figure 3E). This evi- 
dence supports another possible sequence of rearrange- 
ments: (1) A non-reciprocal translocation occurred 
between chromosomes 1 and 22 (supported by the 



output from the GASV algorithm [48] for clustering of 
discordant reads as discussed in Additional file 1, Sec- 
tion Q) resulting in the clonal loss of Ip and 22ql2.2- 
13.3. Following this translocation, two copies of 
22qll. 2-12.1 remained. (2) One of these remaining two 
copies of 22ql 1.2- 12.1 was deleted in a subclonal popu- 
lation (see Additional file 1, Figure S13). 
Breast tumor: 40x sequence coverage 

We also analyzed two tumor samples from [25] 
sequenced at approximately 40x coverage. For sample 
PD4088a, the model of this mixture preferred by our 
model selection procedure was a single clonal tumor 
population with normal admixture fraction 41%. [25] 
also reported this sample as clonal, although they did 
not provide an estimate of tumor purity or copy number 
aberrations. Further details of the analysis of this sample 
are in Additional file 1, Section S and Figure S16. 

We analyzed sample PD4115a, sequenced at approxi- 
mately 40x coverage using THetA, again considering 
the case where the mixture contains a normal admixture 
with a single tumor subpopulation {n = 2) and a normal 
admixture with two distinct tumor subpopulations (« = 
3). Our BIC model selection chose the model where the 
sample is a mixture of normal cells and two distinct 
subpopulations of tumor cells (Figure 4A) over the 
model where the sample contains a single tumor subpo- 
pulation with a normal admixture. While [25] provided 
some analysis of aberrations in this example, they did 
not provide a complete tumor history as they did for 
the 188x coverage genome. Complete information of 
our analysis of this sample, when we consider it as a 
mixture of a single tumor subpopulation along with a 
normal admixture, is in Additional file 1, Section R and 
Figure S14. For the model considering multiple tumor 
subpopulations, we analyzed a subset of longer intervals 
that are most informative for copy number analysis 
(further details are in Materials and methods). We esti- 
mated a normal admixture of 24% (tumor purity 76%) 
and two tumor subpopulations of 43.3% and 32.7%. The 
presence of these subclonal populations is apparent 
from visual inspection of corrected read depth ratios 
after centering the distribution (ratios in chromosome 
20 - which is predicted to contain no copy number 
aberrations - are translated to have a mean ratio of 1) 
and correction for a normal admixture (Figure 4B). In 
particular, a large peak near a corrected ratio of 0.5 
represents clonal deletions (Figure 4C). In addition, two 
overlapping, but distinct smaller peaks appear between 
the clonal deletions and regions of normal copy (Figure 
4D and 4E). These peaks represent two distinct subclo- 
nal populations in the tumor sample. A statistical test of 
the difference in read depth ratios between these peaks 
supports the conclusion that these subclonal populations 
are indeed distinct (see Additional file 1, Figure S15). 
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Figure 4 Analysis of the 40x coverage breast tumor PD4115a. (A) (Left) Read depth ratios (gray) and the inferred copy number aberrations 
calculated by our algorithm when n = 3 including the normal population (black), dominant (clonal) tumor population (blue) and subclonal 
tumor population (red). (Right) A reconstruction of the tumor mixture with the inferred aberrations and estimated fraction of cells in each 
subpopulation. (B) Distribution of read depth ratios over 50 kb intervals after centering and correction for 24% normal admixture using a simple 
inear scaling. Several peaks fall near to expected corrected ratios (0.5, 1, 1.5, 2). Two overlapping but distinct peaks can be seen indicating 
multiple subclonal deletions in similar proportions (labeled D and E). (C) (Top) Read depth ratios in 50 kb bins for chromosomes 5, 9 and 11, 
each of which has a clonal deletion (purple). (Bottom) Distribution of read depth ratios after correction for the aberration fraction of 75% of the 
sample. (D) (Top) Read depth ratios in 50 kb bins for chromosomes 3, 4 and 5, each of which has a subclonal deletion (blue). (Bottom) 
Distribution of read depth ratios after correction for the aberration fraction of 43.3% of the sample. (E) (Top) Read depth ratios as in (D), but a 
different subclonal deletion is highlighted (red). (Bottom) Distribution of read depth ratios after correction for the aberration fraction of 32.7% of 
the sample. 
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Virtual SNP array analysis of this sample was difficult 
due to the lower sequence coverage. This leads to over- 
lapping clusters in the read-count/variant-allele plot, as 
well as distinct banding resulting from the integrality of 
read counts (Figure 5A). The only clearly distinct clus- 
ters are for highly amplified regions, which have corre- 
spondingly higher read counts. Since our analysis for 



this model used only a subset of chromosome intervals 
to infer normal admixture and tumor subpopulations, 
we were able to use the resulting genome mixing vector 
to analyze other chromosomes that were not used in 
computing the maximum likelihood solution. We ana- 
lyzed several regions in chromosome 8 (Figure 5B), a 
chromosome with a complicated amplification pattern. 
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Figure 5 Analysis of chromosome 8 in sample PD4115a. (A) Virtual SNP array data from tliis sample show few distinct clusters (compared 
with the 188x sample in Figure 3A), with amplification of chromosome 8 (green) being the most prominent. (B) Read depth ratios for 
chromosome 8 organized by genomic coordinate. (C) Histograms of read depth ratios for chromosome 8 corrected for 24% normal admixture, 
indicating regions of copy numbers 2, 3 and 4 (cyan, orange and pink), with the latter two being clonal amplifications. (D) Variant allele 
frequencies for chromosome 8. The region with copy number 4 (pink) has variant allele frequencies clustered around 0.5, suggesting duplication 
of both chromosomal homologs, while the telomeric region with copy number 2 (cyan) has a loss of heterozygosity, suggesting a copy neutral 
LOH event. LOH: loss of heterozygosity 
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After correcting read depth ratios in 50 kb intervals in 
this region for the estimated normal admixture of 24%, 
three distinct peaks centered near ratios of 1, 1.5 and 2 
were apparent, corresponding to integer copy numbers 
of 2, 3 and 4, respectively, in the tumor sample (Figure 
5C). The amplifications are clonal aberrations. Interest- 
ingly, the variant allele frequencies for germline SNPs in 
the regions corresponding to the peak at corrected ratio 
2 (copy number 4) are centered at 0.5. This implies that 
both homologs of chr8ql3-21 are present at equal copy 
number in this region; that is, there is a duplication of 
both homologs (Figure 5D). In addition, we observed 
that the variant allele frequencies for chromosome 8p 
are centered at the values of 0 and 1, although this seg- 
ment of the chromosome is inferred to have copy num- 
ber 2. This indicates that there was a copy-neutral loss 
of heterozygosity (LOH) in this region. LOH in 8p has 
been previously reported in breast cancer [49,50] and 
copy neutral LOH in Bp has been reported in cell line 
data for other cancers [51]. 

Discussion 

We introduce Tumor Heterogeneity Analysis (THetA), 
an algorithm that infers the most likely collection of 
genomes and their proportions from high-throughput 
DNA sequencing data, in the case where copy number 
aberrations distinguish subpopulations. We show that 
THetA outperforms three other methods, CNAnorm 
[18], ASCAT [12] and ABSOLUTE [13], for inferring 
tumor purity and identifying copy number aberrations 
in the case of a single tumor cell population admixed 
with normal (non-cancerous) cells. Moreover, we 
demonstrate that THetA successfully estimates tumor 
purity even at low purity (10%) and with modest 
sequence coverages (approximately 30x) on both real 
and simulated data. In contrast to other recent methods 
[12,13] that first infer average ploidy across the genome, 
THetA simultaneously estimates tumor purity and com- 
putes the integral copy number of each genomic seg- 
ment/interval. These advantages result from THetA 
exploiting the large number of data points (reads) that 
measure copy number aberrations in high-throughput 
sequencing data - information that is not available from 
SNP arrays. 

We also demonstrate that THetA successfully decon- 
volves a tumor sample into a normal population and 
multiple tumor subpopulations, inferring the proportion 
of each subpopulation in the mixture, and partitioning 
copy number aberrations into clonal and subclonal 
populations. Other existing methods, such as ASCAT 
[12], ABSOLUTE [13] and CNAnorm [18], do not 
directly infer multiple subpopulations. Further, we show 
that these methods can produce highly inaccurate esti- 
mates of tumor purity on samples containing multiple 



subpopulations, and are sometimes unable to identify 
some copy number aberrations that occur in subpopula- 
tions of tumor cells. In addition, THetA reports all pos- 
sible solutions of interval count matrices C and genome 
mixing vectors ^ with the same maximum likelihood, 
allowing users to explore different maximum likelihood 
solutions. Thus, THetA is an attractive alternative to 
these methods. 

We demonstrated the advantages of THetA using 
three breast cancer genomes sequenced in [25]: one 
sequenced at approximately 188x coverage and two at 
approximately 40x coverage. Nik-Zainal et al. [25] 
showed how a large amount of information about a 
tumor's evolutional history can be derived by analyzing 
clonal and subclonal mutations in high-coverage 
sequencing data. Our THetA algorithm automates 
some of the manual analysis involved in such recon- 
structions. For the 188x genome, our results are lar- 
gely concordant with the extensive analysis and 
annotation of this sample in [25]. THetA automatically 
recovered nearly all of the copy number aberrations 
reported in [25], but with some differences in the clas- 
sification of aberrations as clonal or subclonal. Allele 
data not used by THetA provides external evidence 
that support the THetA results in several cases. On 
one of the 40x coverage genomes, we identified two 
previously unreported tumor subpopulations in nearly 
equal proportions, as well as a 24% normal admixture. 
These results are supported by statistical comparisons 
of read depth ratios, and also allowed us to identify 
copy-neutral LOH on chromosome 8q. Thus, we 
demonstrated that it is possible to identify multiple 
tumor populations successfully in a single sample by 
considering a subset of genomic intervals. Further, we 
did so for an approximately 40x sequenced tumor, 
demonstrating the ability to identify intra-tumor het- 
erogeneity at sequence coverages that are the current 
standard in cancer sequencing studies. 

THetA uses only read depth for inferring intra-tumor 
heterogeneity, in contrast to other methods [12,13,17,25] 
that use allele frequencies of heterozygous germline 
SNPs and somatic mutations. Since copy number aberra- 
tions - even those of a modest size - affect a large number 
of reads, THetA is able to infer multiple tumor subpopu- 
lations directly from sequencing data. However, THetA 
also has some limitations. First, the reliance on copy 
number aberrations means that THetA is unable to iden- 
tify tumor subpopulations that do not contain copy num- 
ber aberrations. As copy number aberrations are 
ubiquitous in many types of cancers, particularly solid 
tumors, we expect that THetA will prove useful for ana- 
lyzing a wide range of different cancer samples. Second, 
while the mathematical model used by THetA allows for 
any number of subpopulations, in practice the number of 
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subpopulations that can be correctly inferred depends on 
having at least one copy number aberration that distin- 
guishes every subpopulation. Finally, THetA's computa- 
tion time increases with an increasing number of 
subpopulations. 

Our focus in the development of THetA was to 
address rigorously the difficult problem of analyzing 
tumor purity and subclonal copy number aberrations 
from DNA sequencing data. A logical next step is to use 
the output from THetA to help predict single-nucleotide 
mutations in tumor samples and/or assess the clonality 
of somatic mutations, both challenging problems in 
their own right. Carter et al. [13] and Nik-Zainal et al. 
[25] show that once tumor purity is correctly estimated, 
then this value can be used to analyze the clonality/sub- 
clonality of somatic mutations. Incorporating the addi- 
tional signal of variant allele frequencies into the 
probabilistic model, as well as extending the model to 
allele-specific copy number changes [52], are important 
directions for future work. Ultimately, a desirable goal is 
to integrate into a single probabilistic framework the 
detection of all types of somatic aberrations (single 
nucleotide, copy number and rearrangements) with the 
estimation of tumor purity and the derivation of tumor 
subpopulations. Finally, further algorithmic improve- 
ments in THetA would help in the analysis of more 
complicated tumor samples that have more intervals 
(for example, smaller copy number aberrations), higher 
amplitude copy number aberrations, more subpopula- 
tions or more complicated rearrangements; for example, 
due to breakage/fusion/bridge (B/F/B) cycles [53], chro- 
mothripsis [54] or extrachromosomal amplifications 
[55]. THetA runs in polynomial time for a mixture of 
two genomes with intervals of equal weight, but the 
question of the complexity of the MLMDP for n > 2 
remains open. 

A number of other techniques have recently been used 
to study intra-tumor heterogeneity. For example [56] 
uses expression profiles across different individuals to 
identify differentially expressed genes with respect to 
healthy cells at the cancer site of origin. Single-cell 
sequencing and multi-region sequencing from a primary 
tumor are alternative strategies that have been success- 
fully employed [23-29]. As these technologies improve 
they will likely further contribute to our understanding 
of intra-tumor heterogeneity. However, sequencing of 
primary tumor samples as well as matched tumor/ 
metastasis samples will remain a dominant protocol for 
some time. Thus, algorithms, such as THetA, ABSO- 
LUTE, ASCAT and others, that can derive information 
about intra-tumor heterogeneity from DNA sequencing 
of tumor samples are a useful complement to other 
technologies and techniques for tumor heterogeneity 
studies. 



Conclusions 

Tumors are highly heterogeneous with individual cells 
in a tumor typically having different complements of 
somatic mutations. Highly accurate estimates of tumor 
purity and tumor subpopulation frequencies are neces- 
sary for investigating intra-tumor heterogeneity from 
single tumor samples. We introduce THetA, an algo- 
rithm that infers the most likely collection of genomes 
and their proportions from high-throughput DNA 
sequencing data, in the case where copy number aberra- 
tions distinguish subpopulations. We show the power of 
THetA with both simulated and real sequencing data - 
demonstrating the ability to identify intra-tumor hetero- 
geneity (in particular subclonal copy number aberra- 
tions) at modest sequence coverages (approximately 
40 x) that are the current standard in cancer sequencing 
studies. 

Materials and methods 

Intervals and counts: probability model 

In this section we derive the probability P{r\C, fi) in the 
maximum likelihood mixture decomposition problem 
(MLMDP). 
Single genome 

Following the usual assumptions (for example, the 
Lander-Waterman model), we assume that the starting 
positions of reads in a cancer genome are uniformly dis- 
tributed over its length. The probability of a read from a 
cancer genome aligning to an interval /y in the reference 
genome depends on: (i) the number of copies of the 
interval in the cancer genome, (ii) the length of the 
interval and (iii) possible difficulties in aligning reads to 
Ij due to repetitive sequence or other effects. We first 
describe the model under the simplifying assumption 
that there are no alignment difficulties and all the inter- 
vals in I are of equal length, which without loss of gen- 
erality we set to length 1. Below we show how to 
remove these restrictions by incorporating an interval 
weight vector w into the model, which assigns a weight 
to each interval in proportion to its length or mappabil- 
ity. Let c = (ci, c^) be the (unknown) number of 
copies of each interval in the cancer genome. Then the 
probability pj that a read aligns to Ij satisfies: 



where |c|i = YlJ^i '^j is the /i-norm of c. We use the 
notation: 



X 




to denote a normalized vector. Thus, the observed read 
depth vector r is the result of r = YlJ^i independent 
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draws from a multinomial distribution with parameter 
p = c; that is, r ~ Mult (r;c). 

We emphasize that the number of reads aligning to each 
interval are not independent random variables. As the total 
number of reads becomes extremely large, the multinomial 
distributions will converge to independent Poisson distribu- 
tions in each interval. However, even with the millions to 
billions of reads produced by high-throughput DNA 
sequencing, the effects of using a finite number of reads are 
an issue in cancer genome sequencing. This is because 
large copy number changes - including the gain and loss of 
whole chromosomes - are common in cancer genomes. A 
large deletion or duplication will affect the number of reads 
observed in other intervals; for example, if we consider the 
22 autosomes, a homozygous deletion of chromosome 
1 will reduce the effective length of the cancer genome by 
8.65%, altering the expected number of reads observed 
in other intervals (see Additional file 1, Figure SI). 
Mixture of genomes 

Now suppose we sequence a tumor sample f and align 
the obtained reads to the reference genome, observing a 
read depth vector r = (ri, r^^) e N'". Let C = [cy/,] be 
the (unknown) interval count matrix and ^ be the 
(unknown) genome mixing vector for the tumor sample. 
Here ^ is required to be an element of the unit (« - 1)- 



simplex a„ 



(Ml, ■ • ■ , l^nf e R"| ^ M/i = 1. and > 0 for all h ! 



Then the probability pj that a read aligns to Ij is the 
ratio of DNA in 7" from Ij compared to the total 
amount of DNA in the sample. That is: 



Pi 



(C/x)j 
ICmIi 



Therefore, r is the result of r = 2_^. ^ Tj draws from a 
multinomial distribution with parameter (Figure 1): 



p = C/x 



ICmIi 



That is r - Mult (r; C/x). 

Solving the maximum likelihood mixture decomposition 
problem 

We show here how to solve the MLMDP as a disjunc- 
tion of separate convex optimization problems. The 
negative log-likelihood of r as a function of the generic 
multinomial parameter p e A^-i is: 



£,(p) = -log(P(r|p)) = -log 



(1) 



where a is a constant, depending only on r. Finding 
the multinomial parameter p that minimizes this nega- 
tive log-likelihood function is straightforward. Using a 
Lagrange multiplier to encode the constraint p e A^-i, 
one determines that the (unique) value p" maximizing 
£, (p) satisfies: 

p1 = — 

Moreover, if the entries of r are integers (as they will 
be for read counts) and C is permitted to be any inte- 
ger-valued matrix, then the (unconstrained) solution: 



can be written in the form p = C/x (see Additional file 1, 
Section C). Thus, a solution of the MLMDP is obtained by 
maximizing the multinomial likelihood over all p e A^-i. 
Constraints on C 

In the Results section above, we described three natural 
constraints on the interval count matrix C. We define 
0.m,n to be pairs (C, n) where C satisfies the first two of 
those conditions: 



{(C, M) |ci = 2"", Cj e N'" for j > 1, /x e A„_ 



(2) 



Similarly, we define 0,m,n,k — ^m,n to be the pairs (C, 
fi) where C satisfies all three of the conditions: 

^m,n,k = I (C, lu) |ci = 2"', Cj e (0, . . . , k}'" for j> l,iie A„_i } (3) 

In the following, we will use CI to refer to either D.m,n 
or 0,f„,n,h as appropriate. Given a pair (C, pi) e CI, we 
define the negative log-likelihood of the observed read 
depth vector r using the multinomial model to be: 



Cr (C, 11) = - log (P (r|C, IJ.)} = -J2 log ((Cm),) + « 



(4) 



For an observed r, our goal is to find the C and ^ that 
minimize (4). We define the following optimization pro- 
blem where the domain of (C, fc) can be either of the 
domains CI defined above: 

(C*, n*) = argmin(c,^)ej2,Cr (C, fi) = argminjc^j^j^^Cr (Cfj.) (5) 

Since all entries of C are positive integers and all fij 
are positive reals, (5) is a mixed integer problem. In gen- 
eral, mixed integer linear programming (MILP) pro- 
blems are NP-hard to solve [57]. In our case, the 
objective function is a non-linear function of C and fi, 
meaning that even sophisticated MILP solvers are unli- 
kely to be much benefit for this problem. 
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A coordinate transformation 

Rather than attempting to solve the optimization pro- 
blem (5) as a generic MILP, we derive a coordinate 
transformation that allows us to solve this problem as a 
constrained optimization problem in El™. First, note that 
a pair (C, ^) e D defines a probability distribution qI- 
We define P^n = {Cfi\ (C, fi) € ^2] to he the space of all 
such probability distributions for all (C, pi) e Ci. Note 
that only C/i,i and not (C, fi), is identifiable from the 
observed data r. We prove the following theorem in 
Additional file 1, Section D. 

Theorem 1. Suppose p L Po, so p = C^, for some (C, 
fi) e Q.. Then there exists fi' L . ^ such that p = C/x' 
where C = (ci, . . . ,€„)■ 

Now suppose the interval count matrix C is fixed, and 

let H(C) = |c/x|/i, e A„_i| denote the set of convex 

combinations of the normalized column vectors in C. 
Then (5) reduces to the problem of finding argmin p ^ 
H(c) ^r{p)- Since the objective function £, (p) is separ- 
able convex (see Additional file 1, Section F) and the 
domain H{C) is convex, this problem is easy to solve 
using standard convex optimization routines. 

Let Cm,n,fe = {C| (C, fi) e fim,n,fe} be the set of interval 
count matrices C appearing in Cli^^y, /^. Considering all 
interval count matrices C e Cm.n.fe gives the following 
optimization problem: 



min/;r (p) subject to p e Ucgc„,,,aH (C) 



(6) 



Figure 6 illustrates the geometry of this optimization 
problem. Since in general a union of convex sets is not 



convex, the constraint set in (6) is not convex. A brute- 
force approach to this problem is to enumerate all C e 
C/z, but the number of such matrices is exponential in 
m and «. Note that in the Results section, THetA 
demonstrates improved performance in computing C 
and ^ when the number m of intervals increases in the 
case where n = 3. This is expected from the convex geo- 
metry used by our algorithm: for a fixed interval count 
matrix C, each value C/x> defines a 2-plane in . i 
(see Additional file 1, Figure S2). These planes become 
more sparse in A„ _ i as w increases, and thus our algo- 
rithm is less prone to overfitting. In the next section, we 
show that in the n = 2 case we can restrict the space of 
C matrices to a number that is polynomial in m. 

A more efficient algorithm for the MLMDP 

We derive an algorithm to solve the MLMDP (as formu- 
lated in (6)) that is polynomial time in m when n = 2. 
This algorithm relies on the observation that it is neces- 
sary to consider only a subset of interval count matrices 
C whose entries satisfy ordering constraints imposed by 
the read depth vector r. We say that two vectors a = 
(«i, Um) and b = {bi, b^) e R™ have compatible 
order if for all 1 < i, J < m, fl, < aj if and only if bi < bj. 
Note that the vector x = {s, s) e R™ for any s e R 
has compatible order with all vectors in R™. 

Theorem 2. Suppose p* = C*//* = argminp^p^^^ £r (p> 
Then we have the following: 

1. p* and r have compatible order. 

2. If n = 2 and fi2 > 0, then r and C2 have compatible 
order. 





0.2 0.4 0.6 0.8 1 



Figure 6 Convex geometry of the MLMDP used in the THetA algorithm. (Left) For a single cancer genome with normal admixture, the 
interval count vector C2 of the cancer genome and tumor purity fj define a collection of rays C/j, for /j g 0[1]. (Here we show the space O323). 
(Right) Normalizing these rays, we obtain the parameter p = (3^, used in the multinomial likelihood. These parameters are embedded in the 
simplex Am , i(gray triangle with a black outline) because their entries sum to one. (This is the space fj23 2 3-) For a fixed interval count matrix C 
= (c,, C2) a blue ray (left) defined by C/j is mapped to the corresponding red/green ray (right) connecting Ci to Q (right), the normalized 
columns of C, as described in Theorem 1. For n > 2, hyperplanes are mapped to hyperplanes (see Additional file 1, Figure S2). We show p* = f, 
the maximum likelihood solution when interval counts are not constrained to be integers. Note that this point is not on any of the rays defined 
by interval count matrices. Rays that satisfy the ordering constraint from Theorem 2 are in green. IVILMDP: maximum likelihood mixture 
decomposition problem 
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Theorem 2 (proof is in Additional file 1, Section E) 
leads to a more efficient algorithm where we evaluate 
only matrices C = (ci, C2) e Cm,2,k where C2 has compati- 
ble order with r. The number of such matrices is 0{m'^) 
and enumeration of these matrices is 0{m^*^) (see Addi- 
tional file 1, Section E). Note that if « > 3, that is, there is 
more than one cancer genome in the mixture, then the 
ordering constraints do provide some restrictions on the 
entries of C (see Additional file 1, Section E). The reduc- 
tion is not enough to make the space have polynomial 
size (in m), but the restrictions are useful in practice. 

Intervals of unequal length and mappability 

Thus far, we made the simplifying assumptions that all 
intervals in I are of equal length and that reads are 
aligned to each interval without any biases from the 
DNA sequence of the interval. Now we consider the 
general case where each interval Ij has an associated 
positive weight Wj. These weights can model both inter- 
val lengths as well as different mappability of intervals - 
that is, the probability of reads aligning uniquely to an 
interval in the reference genome can depend on the 
repeat content of the interval [58]. Let w = (wi, u'„,) 
be the interval weight vector. In practice, we use the 
read depth vector over I for the paired normal sample 
as w, which allows us to implicitly incorporate informa- 
tion on interval length, mappability and GC content 
into the model. 

Consider a single cancer genome where c = (ci, c,„) 
is the number of copies of each interval in the cancer 
genome. Then the probability pj of a read aligning to 
interval Ij in the reference genome is: 



the probability pj of a read aligning to interval Ij satis- 
fies: 



(=1 



I Well 



where W is a diagonal matrix such that Wjj = Wj. 
Therefore, the observed read depth vector r is obtained 
by r = X!;=i independent draws from a multinomial 
distribution with parameter: 



IWclJ 



I Well' 

We define the linear transformation O: R*" ^ R'" to 
be O (v) = Wv- Thus, p = 0(0) and r ~ Mult(r; 0(c)). 
As in the unweighted case above, if the entries in c are 
allowed to be arbitrary positive integers, then for any 
integer read depth vector r and non-negative weight 
vector w we can always find the maximum likelihood 
solution to the corresponding weighted MLMDP (see 
Additional file 1, Section G). 

Similarly, if we consider a tumor mixture 7" with 
interval count matrix C and genome mixing vector i^i, 



Pi 



" |WC/x|i 



o (Cm) 



^w,(C/x),- 
1=1 



Given a read depth vector r and an interval weight 
vector w, we formulate the analogous maximum likeli- 
hood mixture decomposition problem of identifying the 
underlying interval count matrix C and genome mixing 
vector fi that maximize the multinomial likelihood Mult 
(r|cD(C^)). 

Theorem 3 (see Additional file 1, Section G for the 
proof) relates the optimal (C, ^) in the cases of equal 
and unequal weighted intervals. 

Theorem 3. Let O'^ : R*" ^ R*" be <d— 1 (y) = 
We have the following set equality: 

argmin(c,;u)e£2,„,„-Cr (<I> (Cji)) = argmin(c,^)€S2„,,„'C4>-i(r) (C/x) 

Using this theorem, we find the optimal solution in 
the weighted interval case by solving the unweighted 
interval case; for example, using the techniques above. 
As stated. Theorem 3 applies to the case where (C, ^) e 
Clm,n (that is, the entries of C are unbounded). However, 
we can still leverage the logic behind this result when 
we add a restriction that C e Cm,2,k- While we do not 
expect that aTgmm^c,n)eS2,„j,k^T (C/x)) is equal to 
argmin(c,^)gn,„ j,,£<i>-i(r) (C/.t), we may assume that a 
solution to aigmm(c.ti)eQ„,,2,k^i (C/x)) will satisfy the 
same order constraints as £(j>-i(r) (C/x). Namely, we 
expect that the optimal solution will have compatible 
order with O' (r) (Theorem 2). This is because: (1) the 
unconstrained optima (when (C, fi) e Clm,n) for the two 
likelihood functions are equal, (2) the objective function 
£r(p) is well behaved (separable convex) and (3) the 
transformation O is linear. Thus, the optima in the con- 
strained weighted case cannot deviate too much from 
the optima in the constrained unweighted case, where 
the ordering conditions hold. Thus, we need only to 
consider C e Cm,2,k where C2 has compatible order with 
<I''^(r) to find an optimum. We verified this statement 
empirically over a variety of simulations (see Additional 
file 1, Section H). 

Model selection 

We use the Bayesian information criterion (BIC) to 
make a selection from different sized models (that is, 
different values of «) and their corresponding sets of 
maximum likelihood solutions. The standard form of 
the BIC is -2 log(Z,) + a log(&) where L is the likelihood 
of a solution, a is the number of free parameters in the 
model and b is the number of data points. We add a 
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slight modification to this, which is similar to a modifi- 
cation used by the segmentation algorithm BIC-Seq [32] 
that allows use of more stringently penalized solutions 
with more free parameters using a new parameter y. 
The motivation is that the BIC tends to be too liberal 
when the model space is large [59] - as is the case here. 
Values of y above 1 will penalize models that have more 
distinct tumor populations more strongly. That is, 
increasing this parameter will more strongly encourage 
solutions with fewer subpopulations. The default value 
of y is 10, and was chosen because we expect to recover 
a small number of distinct subpopulations from sequen- 
cing data - thus making penalization of models with 
more subpopulations attractive. Additionally, changing y 
in either direction (by up to 4) from this default value 
yields consistent results on the datasets analyzed. Our 
modified BIC is -2 log(Z-) + ya log(&), where a = (m + I) 
(k - 1) and b is the total number of reads in the inter- 
vals for both the tumor and normal samples. Since we 
often run the n = 3 version of the algorithm on a subset 
of the intervals used in the n = 2 algorithm, we use the 
following steps to determine which value of n to select. 

(1) Run the algorithm for n = 2 and n = 3 using the 
subset of intervals and the lower and upper bounds 
used for n = 3 and obtain respective likelihood values. 

(2) Compute the modified BIC for both values of n and 
choose the one with the lowest value. 

Sets of maximum likelihood solutions 

If (C, fi), (C, e D. such that = C'/x'' ^^^'^ ^^'^ ^'^Y 
observed read depth vector r, the likelihood of observing 
the r will be identical between these two solutions. That 

is, Cr (Cm) = Cr (CV') By default THetA will always 

output the complete set of maximum likelihood solu- 
tions to the MLMDP given the input parameters (for 
example, the maximum copy number k to consider). 
However, THetA has several options that allow a user 
to input additional information, like sample ploidy, 
which may be known in advance. One option allows a 
user to supply an expected ploidy for a sample (for 
example, 4 in the case of a tetraploid genome), and the 
lower and upper bounds considered for all intervals are 
rescaled to reflect this expected ploidy. Another option 
allows a user to set lower and upper bounds on copy 
numbers directly for all intervals in the genome. In 
either case, THetA will still output the complete set of 
maximum likelihood solutions that reflect the options 
supplied by the user. 

Code availability 

The THetA software is available for download from our 
website [60]. For a copy of the software at the time of 
publication please see Additional file 2, although we 



recommend that the latest version of THetA be down- 
loaded from [60]. 

Analysis of breast cancer genomes 

Here we provide additional details of the analysis of the 
breast cancer samples. 
Breast tumor: 188x sequence coverage 
For the n = 2 analysis of sample PD4120a, we used all 
genomic intervals derived following BIC-Seq segmenta- 
tion (A = 100) after removal of all intervals less than 50 
kb in length. For the n = 3 analysis, we selected a subset 
of these intervals by choosing: (1) all chromosomes that 
BIC-Seq partitioned into a single interval and (2) all 
intervals >22 Mb that were reported as having an abnor- 
mal copy number 2) in the n = 2 analysis. We used 
only the longest such interval per chromosome if the 
number of such intervals was large. We later added all 
intervals from chromosome 22 to this subset in order to 
resolve differences between our results and those pre- 
sented in [25]. Since the results for both subsets were 
extremely similar, we present here the results for the lar- 
ger subset (including chromosome 22). Results for the 
smaller subset are given in Additional file 1, Figure S9. 
Breast tumor: 40x sequence coverage 

For the n = 2 analysis of sample PD4115a, we used all 
genomic intervals derived from BIC-Seq segmentation (A 
= 200) after removal of all intervals 50 kb in length. We 
found that PD4115a contains many apparent copy number 
aberrations with the segmentation containing 102 intervals 
(compared to only 69 intervals for sample PD4120a 
above). In addition, this sample also includes several highly 
amplified regions, and no chromosome was segmented 
into a single interval. Thus, we ran THetA for « = 3 on a 
subset of the longest intervals in the BIC-Seq partition, 
and set lower and upper bounds on the copy number for 
each interval (see Additional file 1, Section N). 
Virtual SNP arrays 

To compare those of our predictions that differed from 
those presented in [25], we looked at known germline 
SNP allele frequencies derived directly from the sequen- 
cing data - a virtual SNP array. We emphasize that this 
data was not used by our THetA algorithm for comput- 
ing tumor heterogeneity, and therefore this provides 
independent data for validation. We looked at read cov- 
erage and variant allele frequency for the 907,693 SNP 
positions on the 22 autosomes tested by the Affymetrix 
6.0 SNP array (SNP positions and major and minor 
alleles for hgl9 determined using the UCSC genome 
browser [61]). The read coverage for a SNP position is 
the number of concordant reads with mapping quality 
>30 that have an alignment containing either the major 
or minor allele at the SNP position. The variant allele 
fraction, or BAF, is the fraction of such reads that con- 
tains the minor allele. 
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Additional material 



Additional file 1: Figures and text describing additional information 
such as proofs of theorems or additional experimental results 

Additional file 2: THetA software package at the time of 
publication. In general, it is recommended that the latest version of 
THetA be downloaded from [60]. 
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